| main:                          |                           |
|--------------------------------|---------------------------|
| daddui r1,r0, <mark>0</mark>   | FDEMW 5                   |
| daddui r2,r0, <mark>100</mark> | FDEMW 1                   |
| loop:                          |                           |
| l.d f1,v1(r1)                  | FDEMW 1                   |
| l.d f2,v2(r1)                  | FDEMW 1                   |
| <b>div.d</b> f4,f1,f2          | FDSdddddddddMW 11         |
| s.d f4,v4(r1)                  | FSDESSSSSSSSMW 1          |
| l.d f3,v3(r1)                  | FDSSSSSSSSEMW 1           |
| mul.d f5,f3,f4                 | FDSmmmmmMW 7              |
| s.d f5,v5(r1)                  | FSDESSSSsMW 1             |
| daddi r2,r2,- <mark>1</mark>   | FDSSSSsEMW 1              |
| <b>add.d</b> f6,f4,f5          | FDSSSaaSMW <mark>1</mark> |
| s.d f6,v6(r1)                  | FSSSDEssMW 1              |
| daddui r1,r1, <mark>8</mark>   | sssFDssEMW 1              |
| bnez r2,loop                   | FSSSSDEMW 0               |
| halt                           | Fxxxx 0                   |
|                                | sFDEMW                    |
|                                |                           |

???????

## 9 September 2011 -- Computer Architectures -- part 2/2

Name, Matricola .....

## **Question 2**

Considering the same loop-based program, and assuming the following processor architecture for a superscalar MIPS64 processor implemented with multiple-issue and speculation:

- issue 2 instructions per clock cycle
- jump instructions require 1 issue
- handle 2 instructions commit per clock cycle
- timing facts for the following separate functional units:
  - i. 1 Memory address 1 clock cycle
  - ii. 1 Integer ALU 1 clock cycle
  - iii. 1 Jump unit 1 clock cycle
  - iv. 1 FP multiplier unit, which is pipelined: 6 stages
  - v. 1 FP divider unit, which is not pipelined: 10 clock cycles
  - vi. 1 FP Arithmetic unit, which is pipelined: 2 stages
- Branch prediction is always correct
- There are no cache misses
- There are 2 CDB (Common Data Bus).
- o Complete the table reported below showing the processor behavior for the 2 initial iterations.

| 0           |                |       |       |      |               |                    |
|-------------|----------------|-------|-------|------|---------------|--------------------|
| # iteration |                | Issue | EXE   | MEM  | CDB x2        | COMMIT x2          |
| 1           | l.d f1,v1(r1)  | 1     | 2M    | 3    | 4             | 5                  |
| 1           | l.d f2,v2(r1)  | 1     | 3rt   | / ا  | <del></del> 5 | 6                  |
| 1           | div.d f4,f1,f2 | 2     | 60 ←  |      | -, 16         | 17                 |
| 1           | s.d f4,v4(r1)  | 2     | 411   |      | _             | →) 17              |
| 1           | l.d f3,v3(r1)  | 3     | 511   | 6 // | 7             | 18<br>24,<br>→>24, |
| 1           | mul.d f5,f3,f4 | 3     | 17x ← |      | -23~          | 24                 |
| 1           | s.d f5,v5(r1)  | 4     | 6M    |      |               | >> 24              |
| 1           | daddi r2,r2,-1 | 4     | 51    |      | 6             | 25                 |
| 1           | add.d f6,f4,f5 | 5     | 24.A  | /    | 26 —          | 27                 |
| 1           | s.d f6,v6(r1)  | 5     | 711   | /    |               | >> 27              |
| 1           | daddui r1,r1,8 | 6     | 71    | /    | 8             | 28                 |
| 1           | bnez r2,loop   | 7     | 87    | /    |               | 28                 |
| 2           | l.d f1,v1(r1)  | 8     | 9M    | / 10 | 11            | 29                 |
| 2           | l.d f2,v2(r1)  | 8     | 1om   | / 11 | 12            | 29                 |
| 2           | div.d f4,f1,f2 | 9     | 160 ← |      | - 26          | 30                 |
| 2           | s.d f4,v4(r1)  | 9     | 11n   |      |               | 30                 |
| 2           | l.d f3,v3(r1)  | 10    | 12M   | 13 / | 14            | 31                 |
| 2           | mul.d f5,f3,f4 | 10    | 27X < |      | <b> 33</b>    | 34                 |
| 2           | s.d f5,v5(r1)  | 11    | 13M   |      |               | 34                 |
| 2           | daddi r2,r2,-1 | 11    | 121   |      | 131           | 35                 |
| 2           | add.d f6,f4,f5 | 12    | 34A ← |      | 36            | 37                 |
| 2           | s.d f6,v6(r1)  | 12    | 1LM   |      |               | 37                 |
| 2           | daddui r1,r1,8 | 13    | 141   |      | 15            | 38                 |
| 2           | hnez r2 loon   | 11.   | 157   |      |               | ₹8                 |